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Computational mechanics, an approach to structural com- 
plexity, defines a process's causal states and gives a pro- 
cedure for finding them. We show that the causal-state 
representation — an e-machine — is the minimal one consistent 
with accurate prediction. We establish several results on e- 
machine optimality and uniqueness and on how e-machines 
compare to alternative representations. Further results relate 
measures of randomness and structural complexity obtained 
from e-machines to those from ergodic and information theo- 
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I. INTRODUCTION 

Organized matter is ubiquitous in the natural world, 
but the branch of physics which ought to handle it — 
statistical mechanics — lacks a coherent, principled way of 
describing, quantifying, and detecting the many different 
kinds of structure nature exhibits. Statistical mechan- 
ics has good measures of disorder in thermodynamic en- 
tropy and in related quantities, such as the free energies. 
When augmented with theories of critical phenomena jl]] 
and pattern formation Q), it also has an extremely suc- 
cessful approach to analyzing patterns formed through 
symmetry breaking, both in equilibrium H and, more 
recently, outside it (J]. Unfortunately, these successes 
involve many ad hoc procedures — such as guessing rele- 
vant order parameters, identifying small parameters for 
perturbation expansion, and choosing appropriate func- 
tion bases for spatial decomposition. It is far from clear 
that the present methods can be extended to handle all 
the many kinds of organization encountered in nature, 
especially those produced by biological processes. 



Computational mechanics [^) is an approach that lets 
us directly address the issues of pattern, structure, and 
organization. While keeping concepts and mathemati- 
cal tools already familiar from statistical mechanics, it 
is distinct from the latter and complementary to it. In 
essence, from either empirical data or from a probabilistic 
description of behavior, it shows how to infer a model of 
the hidden process that generated the observed behav- 
ior. This representation — the e-machine — captures the 
patterns and regularities in the observations in a way 
that reflects the causal structure of the process. Use- 
fully, with this model in hand, one can extrapolate be- 
yond the original observational data to make predictions 
of future behavior. Moreover, in a well defined sense 
that is the subject of the following, the e-machine is the 
unique maximally efficient model of the observed data- 
generating process. 

e-Machines themselves reveal, in a very direct way, 
how information is stored in the process, and how that 
stored information is transformed by new inputs and by 
the passage of time. This, and not using computers for 
simulations and numerical calculations, is what makes 
computational mechanics "computational" , in the sense 
of "computation theoretic" . 

The basic ideas of computational mechanics were intro- 
duced a decade ago ||. Since then they have been used 
to analyze dynamical systems [M^l, cellular automata 
§, hidden Markov models (To), evolved spatial compu- 
tation , stochastic resonance |l2|] , globally coupled 
maps Hl3| , and the dripping faucet experiment JIJ] . De- 
spite this record of successful application, there has been 
some uncertainty about the mathematical foundations of 
the subject. In particular, while it seemed evident from 
construction that an e-machine captured the patterns in- 
herent in a process and did so in a minimal way, no ex- 
plicit proof of this was published. Moreover, there was 
no proof that, if the e-machine was optimal in this way, 
it was the unique optimal representation of a process. 
These little-needed gaps have now been filled. Subject 
to some (reasonable) restrictions on the statistical char- 
acter of a process, we prove that the e-machine is indeed 
the unique optimal causal model. The rigorous proof of 
these results is the main burden of this paper. We gave 
preliminary versions of the optimality results — but not 
the uniqueness theorem, which is new here — in Ref. [fL5| . 

The outline of the exposition is as follows. We be- 
gin by showing how computational mechanics relates to 
other approaches to pattern, randomness, and causality. 
The upshot of this is to focus our attention on patterns 
within a statistical ensemble and their possible represen- 
tations. Using ideas from information theory, we state a 
quantitative version of Occam's Razor for such represen- 
tations. At that point we define causal states ||, equiva- 
lence classes of behaviors, and the structure of transitions 
between causal states — the e-machine. We then show 
that the causal states are ideal from the point of view of 
Occam's Razor, being the simplest way of attaining the 
maximum possible predictive power. Moreover, we show 
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that the causal states are uniquely optimal. This com- 
bination allows us to prove a number of other, related 
optimality results about e-machines. We examine the as- 
sumptions made in deriving these optimality results, and 
we note that several of them can be lifted without unduly 
upsetting the theorems. We also establish bounds on a 
process's intrinsic computation as revealed by e-machines 
and by quantities in information and ergodic theories. 
Finally, we close by reviewing what has been shown and 
what seem like promising directions for further work on 
the mathematical foundations of computational mechan- 
ics. 

A series of appendices provide supplemental material 
on information theory, equivalence relations and classes, 
e-machines for time-reversed processes, semi-group the- 
ory, and connections and distinctions between computa- 
tional mechanics and other fields. 

To set the stage for the mathematics to follow and to 
motivate the assumptions used there, we begin now by 
reviewing prior work on pattern, randomness, and causal- 
ity. We urge the reader interested only i n the mathemat- 
ical development to skip directly to Sec. [IF — a synopsis 



of the central assumptions of computational mechanics- 
and continue from there. 



II. PATTERNS 

To introduce our approach to — and even to argue that 
some approach is necessary for — discovering and describ- 
ing patterns in nature we begin by quoting Jorge Luis 
Borges: 

These ambiguities, redundancies, and de- 
ficiencies recall those attributed by Dr. Franz 
Kuhn to a certain Chinese encyclopedia 
entitled Celestial Emporium of Benevolent 
Knowledge. On those remote pages it is writ- 
ten that animals are divided into (a) those 
that belong to the Emperor, (b) embalmed 
ones, (c) those that are trained, (d) suck- 
ling pigs, (e) mermaids, (f) fabulous ones, (g) 
stray dogs, (h) those that are included in this 
classification, (i) those that tremble as if they 
were mad, (j) innumerable ones, (k) those 
drawn with a very fine camel's hair brush, 
(1) others, (m) those that have just broken a 
flower vase, (n) those that resemble flies from 
a distance. 

— J. L. Borges, "The Analytical Language of 
John Wilkins", in Ref. |Jj, p. 103]; see also 
discussion in Ref. [fj"7| . 

The passage illustrates the profound gulf between pat- 
terns, and classifications derived from patterns, that are 
appropriate to the world and help us to understand it 
and those patterns which, while perhaps just as legit- 
imate as prosaic regularities, are not at all informative. 



What makes the Celestial Emporium 's scheme inherently 
unsatisfactory, and not just strange, is that it tells us 
nothing about animals. We want to find patterns in a 
process that "divide it at the joints, as nature directs, 
not breaking any limbs in half as a bad carver might" 
@ Sec. 265D]. 

Computational mechanics is not directly concerned 
with pattern formation per se though we suspect it 
will ultimately be useful in that domain. Nor is it con- 
cerned with pattern recognition as a practical matter as 
found in, say, neuropsychology [^9[, psychophysics p0[ , 
cognitive ethology |2l|, computer engineering p!q] , and 
signal and image processing p3| , p4) . Instead, it is con- 
cerned with the questions of what patterns are and how 
patterns should be represented. One way to highlight the 
difference is to call this pattern discovery, rather than 
pattern recognition. 

The bulk of the intellectual discourse on what patterns 
are has been philosophical. One distinct subset has been 
conducted under the broad rubric of mathematical logic. 
Within this there are approaches, on the one hand, that 
draw on (highly) abstract algebra and the theory of rela- 
tions; on the other, that approach patterns via the theory 
of algorithms and effective procedures. 

The general idea, in both approaches, is that some ob- 
ject O has a pattern V — O has a pattern "represented" , 
"described" , "captured" , and so on by V — if and only if 
we can use V to predict or compress O. Note that the 
ability to predict implies the ability to compress, but not 
vice versa; here we stick to prediction. The algebraic and 
algorithmic strands differ mainly on how V itself should 
be represented; that is, they differ in how it is expressed 
in the vocabulary of some formal scheme. 

We should emphasize here that "pattern" in this sense 
implies a kind of regularity, structure, symmetry, orga- 
nization, and so on. In contrast, ordinary usage some- 
times accepts, for example, speaking about the "pattern" 
of pixels in a particular slice of between-channels video 
"snow" ; but we prefer to speak of that as the configura- 
tion of pixels. 



A. Algebraic Patterns 

Although the problem of pattern discovery appears 
early, in Plato's Meno |2^] for example, perhaps the first 
attempt to make the notion of "pattern" mathematically 
rigorous was that of Whitehead and Russell in Principia 
Mathematica. They viewed pattern as a property, not 
of sets, but of relations within or between sets, and ac- 
cordingly they work out an elaborate relation- arithmetic 
||, vol. II, part IV]; cf. @ ch. 5-6]. This starts by 
defining the relation-number of a relation between two 
sets as the class of all the relations that are equivalent 
to it under one-to-one, onto mappings of the two sets. 
In this framework relations share a common pattern or 
structure if they have the same relation-number. For 
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instance, all square lattices have similar structure since 
their elements share the same neighborhood relation; as 
do all hexagonal lattices. Hexagonal and square lattices, 
however, exhibit different patterns since they have non- 
isomorphic neighborhood relations — i.e., since they have 
different relation- numbers. (See also recoding equivalence 
defined in Ref. |28|| .) Less work has been done on this 
than they — especially Russell |^{| — had hoped. This may 
be due in part to a general lack of familiarity with Volume 
II of Ref. @. 

A more recent attempt at developing an algebraic ap- 
proach to patterns builds on semi-group theory and its 
Krohn-Rhodes decomposition theorem. Ref. (3(J dis- 
cusses a range of applications of this approach to pat- 
terns. Along these lines, Rhodes and Nehaniv have tried 
to apply semi-group complexity theory to biological evo- 
lution J3TJ. They suggest that the complexity of a bi- 
ological structure can be measured by the number of 
subgroups in the decomposition of an automaton that 
describes the structure. 

Yet another algebraic approach has been developed by 
Grenander and co-workers, primarily for pattern recogni- 
tion |3^| . Essentially, this is a matter of trying to invent 
a minimal set of generators and bonds for the pattern 
in question. Generators can adjoin each other, in a suit- 
able n-dimensional space, only if their bonds are compat- 
ible. Each pair of compatible bonds at once specifies a 
binary algebraic operation and an observable element of 
the configuration built out of the generators. (Our con- 
struction in App. |d|, linking an algebraic operation with 
concatenations of strings, is analogous in a rough way.) 
Probabilities can be attached to these bonds, leading in a 
natural way to a (Gibbsian) probability distribution over 
entire configurations. Grenander and his colleagues have 
used these methods to characterize, inter alia, several 
biological phenomena [p3|j34[ . 



B. Turing Mechanics: Patterns and Effective 
Procedures 



The other path to patterns follows the traditional ex- 
ploration of the logical foundations of mathematics, as ar- 
ticulated by Frege and Hilbcrt and pioneered by Church, 
Godel, Post, Russell, Turing, and Whitehead. A more 
recent and relatively more popular approach goes back 
to Kolmogorov and Chaitin, who were interested in the 
exact reproduction of an individual object |55 - 38 1 ; in par- 
ticular, their focus was discrete symbol systems, rather 
than (say) real numbers or other mathematical objects. 
The candidates for expressing the pattern V were univer- 
sal Turing machine (UTM) programs — specifically, the 
shortest UTM program that can exactly produce the ob- 
ject O. This program's length is called O's Kolmogorov- 
Chaitin complexity. Note that any scheme — automaton, 
grammar, or what-not — that is Turing equivalent and for 
which a notion of "length" is well defined will do as a 



representational scheme. Since we can convert from one 
such device to another — say, from a Post tag system |3^] 
to a Turing machine — with only a finite description of the 
first system, such constants are easily assimilated when 
measuring complexity in this approach. 

In particular, consider the first n symbols O n of O and 
the shortest program V n that produces them. We ask, 
What happens to the limit 



lim 



(i) 



where \V\\s the length in bits of program VI On the one 
hand, if there is a fixed-length program V that generates 
arbitrarily many digits of O, then this limit vanishes. 
Most of our interesting numbers, rational or irrational — 
such as 7r, e, v2 — are of this sort. These numbers are em- 
inently compressible: the program V is the compressed 
description, and so it captures the pattern obeyed by the 
sequence describing O. If the limit goes to 1, on the other 
hand, we have a completely incompressible description 
and conclude, following Kolmogorov, Chaitin, and oth- 
ers, that O is random |35| p8| , [hi| , pH . This conclusion is 
the desired one: the Kolmogorov-Chaitin framework es- 
tablishes, formally at least, the randomness of an individ- 
ual object without appeals to probabilistic descriptions 
or to ensembles of reproducible events. And it does so by 
referring to a deterministic, algorithmic representation — 
the UTM. 

There are many well-known difficulties with applying 
Kolmogorov complexity to natural processes. First, as 
a quantity, it is uncomputable in general, owing to the 
halting problem |3S}| . Second, it is maximal for random 
sequences; this can be construed either as desirable, as 
just noted, or as a failure to capture structure, depending 
on one's aims. Third, it only applies to a single sequence; 
again this is either good or bad. Fourth, it makes no al- 
lowance for noise or error, demanding exact reproduction. 
Finally, lim,^,^ \P n \/n can vanish, although the compu- 
tational resources needed to run the program, such as 
time and storage, grow without bound. 

None of these impediments have kept researchers from 
attempting to use Kolmogorov-Chaitin complexity for 
practical tasks — such as measuring the complexity of nat- 
ural objects (e.g. Ref. E^] ) , as a basis for theories of 
inductive inference [ i3l,E4L and generally as a means of 
capturing patterns [49]. As Rissanen E(| p. 49] says, 
this is akin to "learn ing] the properties [or a data set] by 
writing programs in the hope of finding short ones!" 

Various of the difficulties just listed have been ad- 
dressed by subsequent work. Bennett's logical depth ac- 
counts for time resources JlTj. (In fact, it is the time 
for the minimal- length program V to produce O.) Kop- 
pel's sophistication attempts to separate out the "reg- 
ularity" portion of the program from the random or 
instance-specific input data p8p9| ] . Ultimately, these ex- 
tensions and generalizations remain in the UTM, exact- 
reproduction setting and so inherit inherent uncom- 
putability. 
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C. Patterns with Error 

Motivated by these theoretical difficulties and practi- 
cal concerns, an obvious next step is to allow our pattern 
V some degree of approximation or error, in exchange 
for shorter descriptions. As a result, we lose perfect re- 
production of the original configuration from the pattern. 
Given the ubiquity of noise in nature, this is a small price 
to pay. We might also say that sometimes we are willing 
to accept small deviations from a regularity, without re- 
ally caring what the precise deviation is. As pointed out 
in Ref . |17]] 's conclusion, this is certainly a prime motiva- 
tion in thermodynamic descriptions, in which we explic- 
itly throw away, and have no interest in, vast amounts of 
microscopic detail in order to find a workable description 
of macroscopic observations. 

Some interesting philosophical work on patterns-with- 
error has been done by Dennett, with reference not just 
to questions about the nature of patterns and their emer- 
gence but also to psychology |5fj]. The intuition is that 
truly random processes can be modeled very simply — "to 
model coin-tossing, toss a coin." Any prediction scheme 
that is more accurate than assuming complete indepen- 
dence ipso facto captures a pattern in the data. There 
is thus a spectrum of potential pattern-capturers ranging 
from the assumption of pure noise to the exact reproduc- 
tion of the data, if that is possible. Dennett notes that 
there is generally a trade-off between the simplicity of 
a predictor and its accuracy, and he plausibly describes 
emergent phenomena ]5l| , |5^ | as patterns that allow for 
a large reduction in complexity for only a small reduc- 
tion in accuracy. Of course, Dennett was by no means 
the first to consider predictive schemes that tolerate error 
and noise; we discuss some of the earlier work in App. [G|. 
However, to our knowledge, he was the first to have made 
such predictors a central part of an explicit account of 
what patterns are. It must be noted that this account 
lacks the mathematical detail of the other approaches we 
have considered so far, and that it relies on the inexact 
prediction of a single configuration. In fact, it relies on 
exact predictors that are "fuzzed up" by noise. The in- 
troduction of noise, however, brings in probabilities, and 
their natural setting is in ensembles. It is in that setting 
that the ideas we share with Dennett can receive a proper 
quantitative treatment. 



D. Randomness: The Anti-Pattern? 

We should at this point say a bit about the relations 
between randomness, complexity, and structure, at least 
as we use those words. Ignoring some foundational issues, 
randomness is actually rather well understood and well 
handled by classical tools introduced by Boltzmann [^3| ; 
Fisher, Neyman, and Pearson [Q; Kolmogorov 35|; and 
Shannon j55j, among others. One tradition in the study 
of complexity in fact identifies complexity with random- 



ness and, as we have just seen, this is useful for some 
purposes. As these purposes are not those of analyzing 
patterns in processes and in real-world data, however, 
they are not ours. Randomness simply does not corre- 
spond to a notion of pattern or structure at all and, by 
implication, neither Kolmogorov-Chaitin complexity nor 
any of its spawn measure pattern. 

Nonetheless, some approaches to complexity conflate 
"structure" with the opposite of randomness, as conven- 
tionally understood and measured in physics by thermo- 
dynamic entropy or a related quantity, such as Shannon 
entropy In effect, structure is defined as "one minus dis- 
order". In contrast, we see pattern — structure, organi- 
zation, regularity, and so on — as describing a coordinate 
"orthogonal" to a process's degree of randomness. That 
is, complexity (in our sense) and randomness each cap- 
ture a useful property necessary to describe how a process 
manipulates information. This complementarity is even 
codified by the complexity-entropy diagrams introduced 
in Ref. || . It should be clear now that when we use the 
word "complexity" we mean "degrees" of pattern, not 
degrees of randomness. 



E. Causation 

We want our representations of patterns in dynamical 
processes to be causal — to say how one state of affairs 
leads to or produces another. Although a key property, 
causality enters our development only in an extremely 
weak sense, the weakest one can use mathematically, 
which is Hume's [j56| : one class of event causes another 
if the latter always follows the former; the effect invari- 
ably succeeds the cause. As good indeterminists, in the 
following we replace this invariant-succession notion of 
causality with a more probabilistic one, substituting a 
homogeneous distribution of successors for the solitary 
invar iable successor. (A precise statement appears in 
Sec. [V A's definition of causal states.) This approach 



results in a purely phenomenological statement of causal- 
ity, and so it is amenable to experimentation in ways that 
stronger notions of causality — e.g., that of Ref. |^] — are 
not. Ref. |58| independently reaches a concept of causal- 
ity essentially the same ours via philosophical arguments. 



F. Synopsis of Pattern 

In line with these observations, the ideal, synthesizing 
approach to patterns would be at once: 

1. Algebraic, giving us an explicit breakdown or de- 
composition of the pattern into its parts; 

2. Computational, showing how the process stores and 
uses information; 

3. Calculable, analytically or by systematic approxi- 
mation; 
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4. Causal, telling us how instances of the pattern are 
actually produced; and 

5. Naturally stochastic, not merely tolerant of noise 
but explicitly formulated in terms of ensembles. 

This mix is precisely the brew we claim, in all modesty, 
to have on tap. 



III. PATTERNS IN ENSEMBLES: 
PADDLING AROUND OCCAM'S POOL 

Here a pattern V is something knowledge of which lets 
us predict, at better than chance rates, if possible, the 
future of sequences drawn from an ensemble O: V has 
to be statistically accurate and confer some leverage or 
advantage as well. Let's fix some notation and state the 
assumptions that will later let us prove the basic results. 



A. Hidden Processes 

We restrict ourselves to discrete- valued, discrete-time 

for dis- 



stationary stochastic processes. (See Sec. VII B 



cussion of these assumptions.) Intuitively, such processes 
are sequences of random variables Si, the values of which 
are drawn from a countable set A. We let i range over 
all the integers, and so get a bi-infinite sequence 



s-- 



S-iSqSi 



(2) 



In fact, we define a process in terms of the distribution 
of such sequences; cf. Ref. p9[ . 

Definition 1 (A Process) Let A be a countable set. 
Let Q = A z be the set of bi-infinite sequences composed 
from A, Ti : Q i— > A be the function that returns the 
i th element Si of a bi-infinite sequence u> G fl, and T the 
field of cylinder sets ofQ. Adding a probability measure P 
gives us a probability space (f2, T, P), with an associated 

random variable S ■ A process is a sequence of random 

variables Si — Ti(S), i £ Z. 

Here, and throughout, we follow the convention of using 
capital letters to denote random variables and lower-case 
letters their particular values. 

It follows from Def. [j] that there are well defined prob- 
ability distributions for sequences of every finite length. 

Let St be the sequence of S t , S t +i, . ■ ■ , S t +L-i of L ran- 

dom variables beginning at S t . S t = \ the nuu sequence. 

Likewise, S t denotes the sequence of L random variables 

going up to St, but not including it; St = St-L- Both 

S t and S t take values from s L e A L . Similarly, St 



and St are the semi-infinite sequences starting from and 
stopping at t and taking values s and s , respectively. 

Intuitively, we can imagine starting with distributions 
for finite-length sequences and extending them gradu- 
ally in both directions, until the infinite sequence is 
reached as a limit. While this can be a useful picture 
to have in mind, defining a process in this way raises 
some subtle measure-theoretic issues, such as how finite- 
dimensional distributions limit on an infinite-dimensional 
one ch. 7]. To avoid these we start with the infinite- 
dimensional distribution. 

Definition 2 (Stationarity) A process Si is stationary 
if and only if 



for all t G Z, L G Z+ . and all s L G A 1 



(3) 



In other words, a stationary process is one that is 

time-translation invariant. Consequently, P(St— s ) — 

P(So=s) and P(St= s) — P(5' = s), and so we drop 
the subscripts from now on. 

B. The Pool 

Our goal is to predict all or part of S using some func- 
tion of some part of g. We begin by taking the set S of 
all pasts and partitioning it into mutually exclusive and 
jointly comprehensive subsets. That is, we make a class 
TZ of subsets of pasts.Q (See Fig. |l| for a schematic ex- 
ample.) Each p G will be called a state or an effective 

state. When the current history s is included in the set 
p, we will speak of the process being in state p. Thus, we 
define a function from histories to effective states: 



r] : S h-> n 



(4) 



A specific individual history s G S maps to a specific 

state p G Ti,; the random variable S for the past maps to 
the random variable 1Z for the effective states. It makes 
little difference whether we think of r\ as being a function 
from a history to a subset of histories or a function from 
a history to the label of that subset. Each interpretation 
is convenient at different times, and we will use both. 

Note that we could use any function defined on S to 
partition that set, by assigning to the same p all the his- 
tories s on which the function takes the same value. Sim- 
ilarly, any equivalence relation on S partitions it. (See 



1 At several points our constructions require referring to sets 
of sets. To help mark the distinction, we call the set of sets 
of histories a class. 
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App. [b] for more on equivalence relations.) Due to the 
way we denned a process's distribution, each effective 
state has a well defined distribution of futures, though 
not necessarily a unique one.Q Specifying the effective 
state thus amounts to making a prediction about the 
process's future. All the histories belonging to a given ef- 
fective state are treated as equivalent for purposes of pre- 
dicting the future. (In this way, the framework formally 
incorpor ates traditional methods of time-series analysis; 
see App. G 1.) 



H[X] =-Y t ?{x = x) iog 2 P(* = x) 



(5) 




FIG. 1. A schematic picture of a partition of the 
set S of all histories into some class of effective states: 
Tl = {TZi : i = 1,2,3,4}. Note that the Hi need not form 
compact sets; we simply draw them that way for clarity. One 
should have in mind Cantor sets or other more pathological 
structures. 

We call the collection of all partitions TZ of the set of 
histories S Occam's pool. 



C. A Little Information Theory 

Since the bulk of the following development will be con- 
sumed with notions and results from information theory 
f55f , we now review several highlights briefly, for the ben- 
efit of readers unfamiliar with the theory and to fix no- 
tation. Appendix [A] lists a number of useful information- 
theoretic formula?, which get called upon in our proofs. 
Throughout, our notation and style of proof follow those 
in Ref. flol. 



1. Entropy Defined 

Given a random variable X taking values in a count- 
able set A, the entropy of X is 



taking OlogO = 0. Notice that H[X] is the expectation 
value of — log 2 P(X = x) and is measured in bits of infor- 
mation. Caveats of the form "when the sum converges 
to a finite value" are implicit in all statements about the 
entropies of infinite countable sets A. 

Shannon interpreted H[X] as the uncertainty in X. 
(Those leery of any subjective component in notions 
like "uncertainty" may read "effective variability" in its 
place.) He showed, for example, that H[X] is the mean 
number of yes-or-no questions needed to pick out the 
value of X on repeated trials, if the questions are chosen 
to minimize this average [j55f . 



2. Joint and Conditional Entropies 

We define the joint entropy H[X,Y] of two variables 
X (taking values in A) and Y (taking values in B) in the 
obvious way, 

H[X,Y}= (6) 
P(X = x,Y = y)log 2 P(X = x,Y = y) . 

(x,y)eAxB 

We define the conditional entropy H [X\Y] of one random 
variable X with respect to another Y from their joint 
entropy: 



H[X\Y] =H[X,Y] — H[Y] 



(7) 



This also follows naturally from the definition of con- 
ditional probability, since P(X = x\Y = y) = P(X = 
x, Y = y)/P(Y = y). H[X\Y] measures the mean uncer- 
tainty remaining in X once we know Y. 



3. Mutual Information 

The mutual information I[X; Y] between two variables 
is defined to be 



I[X;Y]=H[X]-H[X\Y] 



(8) 



This is the average reduction in uncertainty about X 
produced by fixing Y. It is non-negative, like all entropies 
here, and symmetric in the two variables. 



D. Patterns in Ensembles 



2 This is not necessarily true if n is sufficiently patholog- 
ical. To paraphrase Ref. Jsit , readers should assume that 
all our functions are sufficiently tame, measure-theoretically, 
that whatever induced distributions we invoke will exist. 



It will be convenient to have a way of talking about the 
uncertainty of the future. Intuitively, this would just be 

H[S], but in general that quantity is infinite and awk- 
ward to manipulate. (The special case in which H[s] 
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is finite is dealt with in App. [f].) Normally, we evade Proof. By construction (Eq. (^)), for all L, 



this by considering H[S ], the uncertainty of the next L 
symbols, treated as a function of L. On occasion, we will 
refer to the entropy per symbol or entropy rate |55|,p2f : 



h[S] = lim yH[S L ] 



and the conditional entropy rate, 
h[S \X] 



lim \h[S L \X] 



(9) 



(10) 



where X is some random variable and the limits exist. 
For stationary stochastic processes, the limits always ex- 
ist || Theorem 4.2.1, p. 64]. 

These entropy rates are also always bo und ed above 
by H[S}; which is a special case of Eq. (|A3|). More- 
over, if h[S] = H[S], the process consists of inde- 
pendent variables — independent, identically distributed 
(IID) variables, in fact, since we are only concerned with 
stationary processes here. 

Definition 3 (Capturing a Pattern) 72 captures a 
pattern if and only if there exists an L such that 



H[S \n\ < LH[S] 



(11) 



But 



H[S \H]=H[S \r)(S)] ■ 



H[S \v(S)} > H[S L \ S] 



(13) 



(14) 



since the entropy conditioned on a variable is never more 
than t he en tropy conditioned on a function of the variable 
(Eq. ([All)). Q ED - 

Remark 1. That is, conditioning on the whole of the 
past reduces the uncertainty in the future to as small 
a value as possible. Carrying around the whole semi- 
infinite past is rather bulky and uncomfortable and is a 
somewhat dismaying prospect. Put a bit differently: we 
want to forget as much of the past as possible and so 
reduce its burden. It is the contrast between this desire 
and the result of Eq. ( [l2| ) that leads us to call this the 
Old Country Lemma. 

Remark 2. Lemma [l] establishes the promised upper 
bound on the strength of patterns: viz., the strength 



of the pattern is at most H[S] — H[S 



S]/ Lp as t, 



where 



Lpast is the least value of L such that H[S \ S] < LH[S] . 



This says that 72 captures a pattern when it tells us 
something about how the distinguishable parts of a pro- 
cess affect each other: 72 exhibits their dependence. (Wc 
also speak of n, the function associated with pasts, as 
capturing a pattern, since this is implied by 72 captur- 
ing a pattern.) Supposing that these parts do not affect 
each other, then we have IID random variables, which 
is as close to the intuitive notion of "patternless" as one 
is likely to state mathematically. Note that, be caus e of 
the independence bound on joint entropies (Eq. ( |A3[ )), if 
the inequality is satisfied for some L, it is also satisfied 
for every L' > L. Thus, we can consider the difference 

H[S] - H[S \K]/L, for the smallest L for which it is 
nonzero, as the strength of the pattern captured by 72. 
We will now mark an upper bound (Lemma |l|) on the 
strength of patterns; later we will show how to attain 
this upper bound (Thm. |l]). 

E. The Lessons of History 

We are now in a position to prove a result about pat- 
terns in ensembles that will be useful in connection with 
our later theorems about causal states. 

Lemma 1 (Old Country Lemma) For allTZ and for 

all LeZ+, 

H[s L \n]>H[s L \s]. (12) 



F. Minimality and Prediction 

Let's invoke Occam's Razor: "It is vain to do with 
more what can be done with less" |i|. To use the razor, 
we need to fix what is to be "done" and what "more" and 
"less" mean. The job we want done is accurate predic- 

tion, i.e., reducing the conditional entropies H[S |72] as 
far as possible, the goal being to attain the bound set by 
Lemma ^. But we want to do this as simply as possible, 
with as few resources as possible. On the road to meeting 
these two constraints — minimal uncertainty and minimal 
resources — we will need a measure of the second. Since 
P(S— s ) is well defined, there is an induced measure 
on the 77-states; i.e., P(72 = p), the probability of being 
in any particular effective state, is well defined. Accord- 
ingly, we define the following measure of resources. 

Definition 4 (Complexity of State Classes) The 

statistical complexity of a class 72 of states is 

C^n) = H[R] (15) 

= -^p(7e = p)io g2 p(7e = P ) , 

when the sum converges to a finite value. 

The fi in reminds us that it is a measure-theoretic 
property and depends ultimately on the distribution over 
the process's sequences, which induces a measure over 
states. 
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The statistical complexity of a state class is the average 
uncertainty (in bits) in the process's current state. This, 
in turn, is the same as the average amount of memory (in 
bits) that the process appears to retain about the past, 
given the chosen state class 1Z. (We will later, in Def. [l^, 
see how to define the statistical complexity of a process 
itself.) The goal is to do with as little of this memory 
as possible. Restated then, we want to minimize statis- 
tical complexity, subject to the constraint of maximally 
accurate prediction. 

The idea behind calling the collection of all partitions 

of S Occam's pool should now be clear: One wants to 
find the shallowest point in the pool. This we now do. 



IV. COMPUTATIONAL MECHANICS 

Those who are good at archery learnt from 
the bow and not from Yi the Archer. Those 
who know how to manage boats learnt from 
the boats and not from Wo. 

— Anonymous in Ref. fl64| . 

The ultimate goal of computational mechanics is to 
discern the patterns intrinsic to a process. That is, as 
much as possible, the goal is to let the process describe 
itself, on its own terms, without appealing to a priori 
assumptions about the process's structure. Here we sim- 
ply explore the consistency and well-definedness of these 
goals. Of course, practical constraints may keep us from 
doing more than approximating these ideals more or less 
grossly. Naturally, such problems, which always turn up 
in implementation, are much easier to address if we start 
from secure foundations. 



A. Causal States 

Definition 5 (A Process's Causal States) The 

causal states of a process are the members of the range 

of the function e : S 1 — *■ 2 s — the power set of S ■' 

e(s) = {s'|P(5=s \S= s) = P(S=^ | S= *"') , 

for allies', *s' £S} , (16) 

that maps from histories to sets of histories. We write 
the i th causal state as Si and the set of all causal states 
as S; the corresponding random variable is denoted S, 
and its realization a. 

The cardinality of S is unspecified. S can be fi- 
nite, countably infinite, a continuum, a Cantor set, or 
something str ang er still. Examples of these are given in 
Refs. H and [}l0[ ; see especially the examples for hidden 
Markov models given there. 



Alternately and equivalently, we could define an equiv- 
alence relation ~ £ such that two histories are equivalent if 
and only if they have the same conditional distribution of 
futures, and then define causal states as the equivalence 
classes generated by ~ c . (In fact, this was the original 
approach ||.) Either way, the divisions of this partition 

of S are made between regions that leave us in different 
conditions of ignorance about the future. 

This last statement suggests another, still equivalent, 
description of e: 

e(V) = {V'|P(| £ = S= V) = P(S L = 7 L \ S= V) , 

s^S , ^' es ,LeZ+} . (17) 

Using this we can make the original definition, Eq. jl^), 
more intuitive by picturing a sequence of partitions of 

the space S of all histories in which each new partition, 
induced using L + 1 , is a refinement of the previous one 
induced using L. At the coarsest level, the first partition 
(L = 1) groups together those histories that have the 
same distribution for the very next observable. These 
classes are then subdivided using the distribution of the 
next two observables, then the next three, four, and so 
on. The limit of this sequence of partitions — the point 
at which every member of each class has the same dis- 
tribution of futures, of whatever length, as every other 

member of that class — is the partition of S induced by 
See App. [b| for a detailed discussion and review of 
the equivalence relation ~ e . 

Although they will not be of direct concern in the fol- 
lowing, due to the time-asymptotic limits taken, there are 
transient causal states in addition to those (recurrent) 
causal states defined above in Eq. ([l6|). Roughly speak- 
ing, the transient causal states describe how a length- 
ening sequence (a history) of observations allows us to 
identify the recurrent causal states with increasing pre- 
cision. See the developments in App. [b| and in Refs. (!(]] 
and |35|] for more detail on transient causal states. 

Causal states are a particular kind of effective state, 
and they h ave al l the properties common to effective 
states (Sec. 



IIIB). In particular, each causal state Si 



has several structures attached: 

1 . The index i — the state's "name" . 

2. The set of histories that have brought the process 
to Si, which we denote { s £ Si}. 

3. A conditional distribution over futures, denoted 

P(S \Si), and equal to P(5 |s), s e Si. Since 
wc refer to this type of distribution frequently and 
since it is the "shape of the future" , we call it the 
state's morph. 

Ideally, each of these should be denoted by a different 
symbol, and there should be distinct functions linking 
each of these structures to their causal state. To keep 







the growth of notation under control, however, we shall 
be strategically vague about these distinctions. Readers 
may variously picture e as mapping histories to (i) simple 
indices, (ii) subsets of histories, or (iii) ordered triples of 
indices, subsets, and morphs; or one may even leave e 
uninterpreted, as preferred, without interfering with the 
development that follows. 




FIG. 2. A schematic representation of the partitioning of 

the set S of all histories into causal states Si £ <S. Within 
each causal state all the individual histories s have the same 
morph — the same conditional distribution P(S \s) for future 
observables. 



1. Morphs 

Each causal state has a unique morph, i.e., no two 
causal states have the same conditional distribution of 
futures. This follows directly from Def. || and it is not 
true of effective states in general. Another immediate 
consequence of that definition is that 

P(S=« |5 = e(7))=P(5=s \S= a). (18) 

(Again, this is not generally true of effective states.) This 
observation lets us prove a useful lemma about the con- 
ditional independence of the past S and the future S- 

Lemma 2 The past and the future are independent, con- 
ditioning on the causal states. 

Proof. Recall that two random variables X and Z are 
conditionally independent if and only if there is a third 
variable Y such that 

P(X =x,Y = y, Z = z) 

= P(X = x\Y = y)P(Z = z\Y = y)P(Y = y) . (19) 

That is, all of the dependence of Z on X is mediated 
by Y , For convenience below we note that, re-factoring 
the conditional probabilities, this is equivalent to the re- 
quirement that: 

P(X = x, Y = y, Z = z) 

= P(Z = z\Y = y)P{Y = y\X = x)P(X = x) . (20) 



Let us consider P(g= s .S — a, S=s). 

P ( S= a", «S = a, £=s) 

= P(s=~s \S = a, S= V)P(<S = a, S= V) (21) 
= P(S=^ \S = a, S= *s)P{S = a\ S= «")P(5= *) • 

Now, P(S = cr\ S= s ) — 0, unless a = e( s ), which case 

P(>S = <t\ S= s ) = 1. Either way, the first two factors 
in the last line of Eq. ( f2l|) can be written, by Eq. (|lj) , 

P ( S=~7 \S = a, S= a)P(«S = a\ S= V) 

= P(S= 7 \S = a)P(S = a\ S= V ) , (22) 

so that, substituting Eq. ( |22] ) into Eq. (JsTJ) , 

P ( S= V, S = a, S=^) 

= P(5=s \S = a)P(S = a\ S= a)P(5= a ) • (23) 
QED. 

2. Homogeneity 

Following Ref. |^8| , we introduce two new definitions 
and a lemma which are required later on, especially in 
the proof of Lemma and the theorems depending on 
that lemma. 

Definition 6 (Strict Homogeneity) A set X is 

strictly homogeneous with respect to a certain random 
variable Y when the conditional distribution P(y|X) for 
Y is the same for all subsets o/X. 

Definition 7 (Weak Homogeneity) A set X is 

weakly homogeneous with respect toY i/X is not strictly 
homogeneous with respect to Y , but X \ Xo (X with Xo 
removed) is, where Xo is a subset o/X of measure 0. 

Lemma 3 (Strict Homogeneity of Causal States) 

A process's causal states are the largest subsets of his- 
tories that are all strictly homogeneous with respect to 
futures of all lengths. 

Proof. We must show that, first, the causal states are 
strictly homogeneous with respect to futures of all lengths 
and, second, that no larger strictly homogeneous subsets 
of histories could be made. The first point, the strict ho- 
mogeneity of the causal states, is evident from Eq. (|l7|): 
By construction, all elements of a causal state have the 
same morph, so any part of a causal state will have the 
same morph as the whole state. The second point like- 
wise follows from Eq. (|l7|), since the causal state by con- 
struction contains all the histories with a given morph. 
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Any other set strictly homogeneous with respect to fu- 
tures must be smaller than a causal state, and any set 
that includes a causal state as a proper subset cannot be 
strictly homogeneous. QED. 

Remark. The statistical explanation literature would 
say that causal states are the "statistical-relevance basis 
for causal explanations". The elements of such a basis 
are, precisely, the largest classes of combinations of inde- 
pendent variables with homogeneous distributions for the 
dependent variables. See Ref. |5j| for further discussion 
along these lines. 



B. Causal State-to-State Transitions 



Now iS = Si if and only if s 6 Si, and S' = Sj if and 

only s £ Sj, where by s we mean the history that is 

the immediate successor to s; for consistency, s = ss. 
So we can rewrite Eq. (E8h as 



T (s) = P(s g Sj, S =s, s e Sj) 
ij P(S = S<) 

<- ->i «_ 

_ P(a € Sj, S = s, ss e Sj) 

P(S = S>) 

_ p(s g Si, 7 S e Sj) 

P(5 = S>) . 



(29) 

(30) 
(31) 



The causal state at any given time and the next value 
of the observed process together determine a new causal 
state; this is proved shortly in Lemma |5[ Thus, there is 
a natural relation of succession among the causal states; 
recall the discussion of causality in Sec. [II E| . Moreover, 
given the current causal state, all the possible next values 
have well defined conditional probabilities. In fact, by 
construction the entire semi-infinite future does. Thus, 
there is a well defined probability Ty of the process 
generating the value s € A and going to causal state Sj, 
if it is in state Si. 

Definition 8 (Causal Transitions) The labeled transi- 
ts) 

tion probability is the probability of making the tran- 
sition from state Si to state Sj while emitting the symbol 
s S A: 



P(5' 



Sj, S 



s\S = S t ) , 



(24) 



where S is the current causal state and S' its successor 
on emitting s. We denote the set {T^ : s £ A} by T. 

Lemma 4 (Transition Probabilities) Tk' is given 
by 



r« =P(V se ^|V eSi) 
= P(s 6 s u 7s e Sj) 

P(s G Si) 



(25) 
(26) 



where s s is read as the semi-infinite sequence obtained 
by concatenating s G A onto the end of s . 

Proof. 



T; 



P(S'=Sj, S = s\S = St) 

P{S' =Sj, s 1 =s,s = s l ) 
P(S = Si) . 



(27) 
(28) 



In the third line we used the fact that S= s and S = ss 

jointly imply S = s, making that condition redundant. 
QED. 



Notice that T. 



(A) 



(5,,-; that is, the transition labeled 



- 1] J y ' 
by the null symbol A is the identity. 



C. e-Machines 



The combination of the function e from histories to 
causal states with the labeled transition probabilities 
is called the e-machine of the process 

Definition 9 (An e-Machine Denned) 

The e-machine of a process is the ordered pair {e, T}, 
where e is the causal state function and T is set of the 
transition matrices for the states defined by e. 

Equivalently, we may denote an e-machine by {<S, T}. 

To sa tisfy the algebraic requirement outlined in 
Sec. II F , we make explicit the connection with semi- 
group theory. 



Proposition 1 (e-Machines Are Monoids) The al- 
gebra generated by the e-machine {e, T} is a semi-group 
with an identity element, i.e., it is a monoid. 

Proof. See App. 

Remark. Due to this, e-machines can be interpreted as 
capturing a process's generalized symmetries. Any sub- 
groups of an e-machine's semi-group are, in fact, symme- 
tries in the more familiar sense. 

Lemma 5 (e-Machines Are Deterministic) For each 
Si and s € A, T^ > only for that Sj for which e( ss) — 
Sj if and only if e( s ) = Si, for all pasts s . 

Proof. The lemma is equivalent to asserting that for 

all s € A and s , s e S , if e( s ) = e( s ), then e( ss) = 

e(s s). (ss is just another history and belongs to one or 
another causal state.) 
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Suppose this were not true. Then there would have to 
exist at least one future a such that 



P(S=a | S= as) ^ P(S=s | S= s s) 



(32) 



when nonetheless e(s) = e(s ). Equivalently, we would 
have 



P(S= ss7) ^ P(S=^ss) 
P(S= ss) P(S= *"») 



(33) 



where we read s s as the semi-infinite string that be- 
gins s and continues s . (Remember, the point at which 
we break the stochastic process into a past and a fu- 
ture is arbitrary.) However, the probabilities in the de- 

->i <- t- ^_ 

nominators are equal to P(S = s\ S= s)P(S= a) and 

P(5 = s| 5= s )P(5= s ) ; respectively, and by as- 

-►l <- <_/ -4 <- 

sumption P(S = s| 5= s ) = P(g = s| S= a), since 
<— ' <— 

e( a ) = e(s). Therefore, we would need 



P(S= a a fj _^ P(5= a a a) 
P(S=a') 



(34) 



P(5= a; 
This is the same, though, as 

P(5= «*« | 5= »") ^ P(S= at | 5= • (35) 

This is to say that there is a future s a that has different 
probabilities depending on whether we conditioned on a 

or on s . But this contradicts the assumption that the 
two histories belong to the same causal state. Therefore, 
there is no such future s , and the alternative statement 
of the lemma is true. QED. 

Remark 1. In automata theory p6| , a set of states and 
transitions is said to be deterministic if the current state 
and the next input — here, the next result from the origi- 
nal stochastic process — together fix the next state. This 
use of the word "deterministic" is often confusing, since 
many stochastic processes (e.g., simple Markov chains) 
are deterministic in this sense. 

Remark 2. Starting from a fixed state, a given symbol 
always leads to at most one single state. But there can 
be several transitions from one state to another, each 
labeled with a different symbol. 



Remark 3. Clearly, if T^f > 0, then T^ s) 



P(S = 

s\S = Si). In automata theory the "disallowed" transi- 
tions (T^ — 0) are sometimes explicitly represented anc 
lead to a "reject" state indicating that the particular his- 
tory does not occur. 



Lemma 6 (Causal States Are Independent) The 

probability distributions over causal states at 
times are conditionally independent. 



Proof. What we wish to show is that, writing S, S' , 
S" for the sequence of causal states at three successive 
times, S and S" are conditionally independent, given S' . 
We can do this directly: 

P (S = a,S' = a', S" = a") 

= P(S" = a"\S = a, S' = a')P{S = a, S' = a') 

= P(i 1 £ o|5 = a, S' = (t')P(S = <t,S' = a') , (36) 

where a is the subset of all symbols that lead from a' to 
a". This is a well defined subset, in virtue of Lemma ^ 
immediately preceding, which also guarantees the equal- 
ity of conditional probabilities we have used. Likewise, 

P(S" = a"\S' = a') = P(/e a\S' = a') . (37) 
But, by construction, 

P(S ;1 e o|5 = a, S' = a') = P^e a\S' = a') , (38) 
and hence 

P{S" = <j"\S' = a 1 ) = P(S" = a"\S = a, S' = a') . 



(39) 



So, to resume, 



P ( S = a, S' = a', S" = a") 

= P(S" = a"\S' = <t')P{S = a,S' = a 1 ) 

= P{S" = a"\S' - <j')P{S' = a'\S = a)P(S = a) . (40) 

The last line follows from the definition of conditional 
probability and is equivalent to the more easily inter- 
preted expression given by 



(41) 



P(5"|5')P(5|5')P(5') 



Thus, applying mathematical induction to Eq. (|41 
causal states at different times are independent, condi- 
tioning on the intermediate causal states. QED. 

Remark 1. This lemma strengthens the claim that the 
causal states are, in fact, the causally efficacious states: 
given knowledge of the present state, what has gone be- 
fore makes no differe nce. (Again, recall the philosophical 
preliminaries of Sec. HE .) 

Remark 2. This result indicates that the causal states, 
considered as a process, define a kind of Markov chain. 
Thus, causal states can be roughly considered to be a 
generalization of Markovian states. We say "kind of" 
since the class of e-machines is substantially richer P|lC|] 
than what one normally associates with Markov chains 



Definition 10 (e-Machine Reconstruction) 

e -Machine reconstruction is any procedure that given a 

process P(S)> or an approximation ofP(S), produces the 
process's e-machine {<S,T}. 



12 



Given a mathematical description of a process, one can 
often calculate analytically its e-machine. (For example, 
see the computational mechanics analysis of spin systems 
in Ref. J65|.) There is also a wide range of algorithms 
which reconstruct e-machines from empirical estimates 

of P{S)- Some, such as those used in Refs. [§-0j6^], op- 
erate in "batch" mode, taking the raw data as a whole 
and producing the e-machine. Others could operate in- 
crementally, in "on-line" mode, taking in individual mea- 
surements and re-estimating the set of causal states and 
their transition probabilities. 



V. OPTIMALITIES AND UNIQUENESS 



S € A the next "observable" we get from the original 
stochastic process, S' the next causal state, 7Z the cur- 
rent state according to n, and TV the next ry-state. a will 
stand for a particular value (causal state) of S and p a 
particular value of 1Z. When we quantify over alterna- 
tives to the causal states, we quantify over H. 

Theorem 1 (Causal States are Maximally Pre- 
scient) into 

For all H and all L € Z + , 



H[S L \R] >H[S L \S] 



(42) 



We now show that: causal states are maximally ac- 
curate predictors of minimal statistical complexity; they 
are unique in sharing both properties; and their state- 
to-state transitions are minimally stochastic. In other 
words, they satisfy both of the constraints borrowed from 
Occam, and they are the only representations that do 
so. The overarching moral here is that causal states 
and e-machines are the goals in any learning or model- 
ing scheme. The argument is made by the time-honored 
means of proving optimality theo rems. We address, in 
our concluding remarks (Sec. VII), the practicalities in- 
volved in attaining these goals. 



Proof. We have already seen that H[S [R] > 




FIG. 3. An alternative class 72. of states (delineated by 

dashed lines) that partition S overlaid on the causal states S 
(outlined by solid lines) . Here, for example, £2 contains parts 
of TZi, IZ2, Ti-z and IZ4. The collection of all such alternative 
partitions form Occam's pool. Note again that the IZi need 
not be compact nor simply connected, as drawn. 

As part of our strategy, though, we also prove sev- 
eral results that are not optimality results; we call these 
lemmas to indicate their subordinate status. All of our 
theorems, and some of our lemmas, will be established by 
comparing causal states, generated by e, with other rival 
sets of states, generated by other functions n. In short, 
none of the rival states — none of the other patterns — can 
out-perform the causal states. 

It is convenient to fix some additional notation. Let 
S be the random variable for the current causal state, 



H[S | S] (Lemma 0). But by construction (Def. |), 



P(5 =* \S=s) = P(S =s \S = e(s)). (43) 

Since entropies depend only on the probability distri- 

bution, H[S \S] = H[S \ S] for every L. Thus, 

H[S \H] > H[S \S], for all L. QED. 

Remark. That is to say, causal states are as good at 
predicting the future — are as prescient — as complete his- 
tories. In this, they satisfy the first requirement borrowed 
from Occam. Since the causal states are well defined and 
since they can be systematically approximated, we have 
shown that the upper bound on the strength of patterns 
(Def. U and Lemma [I], Remark) can in fact be reached. 
Intuitively, the causal states achieve this because, unlike 
effective states in general, they do not throw away any 
information about the future which might be contained 

in S- Even more colloquially, to paraphrase the defini- 
tion of information in Ref. |70|, the causal states record 
every difference (about the past) that makes a difference 
(to the future) . We can actually make this intuition quite 
precise, in an easy corollary to the theorem. 



Corollary 1 (Causal States Are Sufficient Statis- 
tics) The causal states S of a process are sufficient statis- 
tics for predicting it. 

Proof. It follows from Thm. | and Eq. (§) that, for all 

L e Z + , 



I[S L ;S] = I[S L ;S] 



(44) 



where / was defined in Eq. (jq). Consequently, the causal 
state is a sufficient statistic — see Refs. |62| , p. 37] and ]7l| , 
sec. 2.4-2.5] — for predicting futures of any length. QED. 

All subsequent results concern rival states that are as 
prescient as the causal states. We call these prescient 
rivals and denote a class of them 72.. 
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Definition 11 (Prescient Rivals) Prescient rivals TZ. 
are states that are as predictive as the causal states; viz., 
for all LeZ+. 



H[S \K]=H[S \S\ 



(45) 



Remark. Prescient rivals are also sufficient statistics. 

Lemma 7 (Refinement Lemma) For all prescient ri- 
vals 7Z and for each p G 7Z, there is a a S S and 
a measure-0 subset po C p, possibly empty, such that 
P \ Po °~, where \ is set subtraction. 

Proof. We invoke a straightforward extension of 
Thm. 2.7.3 of Ref. f6§: If X 1 ,X 2 , ...,X n are random 
variables over the same set A, each with distinct proba- 
bility distributions, O a random variable over the integers 
from 1 to n such that P(0 = i) = Ai, and Z a random 
variable over A such that Z = X&, then 



H[Z] = H[J2 

i=l 

n 



(46) 



In words, the entropy of a mixture of distributions is at 
least the mean of the entropies of those distributions. 
This follows since H is strictly concave, which in turn 
follows from xlogx being strictly convex for x > 0. We 
obtain equality in Eq. ([l6]) if and only if all the Xi are 
either or 1, i.e., if and only if Z is at least weakly 
homogeneous (Def. 0). 

The conditional distribution of futures for each rival 
state p can be written as a weighted mixture of the 
morphs of one or more causal states. (Cf. Fig. |^.) Thus, 
by Eq. (fl6l), unless every p is at least weakly homoge- 

neous with respect to S (for each L), the entropy of 

S conditioned on TZ will be higher than the minimum, 
the entropy conditioned on S. So, in the case of the 
maximally predictive TZ, every p G TZ must be at least 

weakly homogeneous with respect to all S ■ But the 
causal states are the largest classes that are strictly ho- 

mogeneous with respect to all S (Lemma ||). Thus, 
the strictly homogeneous part of each p G TZ must be a 
subclass, possibly improper, of some causal state a G <S. 
QED. 

Remark 1. An alternative proof appears in App. [e| 
Remark 2. The content of the lemma can be made 
quite intuitive, if we ignore for a moment the measure-0 
set po of histories mentioned in its statement. It then as- 
serts that any alternative partition 7Z that is as prescient 
as the causal states must be a refinement of the causal- 
state partition. That is, each TZi must be a (possibly 



improper) subset of some Sj . Otherwise, at least one TZi 
would have to contain parts of at least two causal states. 
And so, using this TZi to predict the future observables 
would lead to more uncertainty about S than using the 
causal states. This is illustrated by Fig. ||, which should 
be contrasted with Fig. ||. 

Adding the measure-0 set po of histories to this picture 
does not change its heuristic content much. Precisely be- 
cause these histories have zero probability, treating them 
in an "inappropriate" way makes no discernible difference 
to predictions, morphs, and so on. There is a problem 
of terminology, however, since there seems to be no stan- 
dard name for the relationship between the partitions TZ 
and <S. We propose to say that the former is a refinement 
of the latter almost everywhere or, simply, a refinement 
a.e. 

Remark 3. One cannot work the proof the other way 
around to show that the causal states have to be a refine- 
ment of the equally prescient 7?.-states. This is precluded 
because applying the theorem borrowed from Ref. J62[ , 
Eq. (^), hinges on being able to reduce uncertainty by 
specifying from which distribution one chooses. Since 
the causal states are constructed so as to be strictly ho- 
mogeneous with respect to futures, this is not the case. 
Lemma || and Thm. [j] together protect us. 

Remark 4- Because almost all of each prescient rival 
state is wholly contained within a single causal state, 
we can construct a function g : 7t <— > S, such that, if 
v( s ) — Pi then e(s) = g(p) almost always. We can even 
say that S = g(TZ) almost always, with the understanding 
that this means that, for each p, P(S = a\TZ = p) > if 
and only if a — g{p). 




FIG. 4. A prescient rival partition TZ must be a refine- 
ment of the causal-state partition almost everywhere. That 
is, almost all of each TZi must contained within some Sj\ the 
exceptions, if any, are a set of histories of measure 0. Here 
for instance S2 contains the positive-measure parts of 7^3, 
TZ4, and TZ^. One of these rival states, say 7^3, could have 
member-histories in any or all of the other causal states, pro- 
vided the total measure of such exceptional histories is zero. 
Cf. Fig. |. 
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Theorem 2 (Causal States Are Minimal) O For 

all prescient rivals Tt, 

C^H) > C,(S) . (47) 



Proof. By Lemma [?], Remark 4, there is a function g 
such that S = g(TZ) almost always. But H[f(X)] < H[X] 



(Eq. (All )) and so 



H[S] = H[g(n)} < H[TZ] 



(48) 



but C^(7t) = i/[7t] (Def. |). QED. 

Remark 1. We have just established that no rival pat- 
tern, which is as good at predicting the observations as 
the causal states, is any simpler, in the sense given by 
Def. 0. than the causal states. (This is the theorem of 
Ref. ||.) Occam therefore tells us that there is no reason 
not to use the causal states. The next theorem shows 
that causal states are uniquely optimal, and so that Oc- 
cam's Razor all but forces us to use them. 

Remark 2. Here it becomes important that we are try- 
ing to predict the whole of S and not just some piece, 

S ■ Suppose two histories s and s have the same con- 
ditional distribution for futures of lengths up to L, but 
differing ones after that. They would then belong to dif- 
ferent causal states. An restate that merged those two 
causal states, however, would have just as much ability 

to predict S as the causal states. More, these 7?.-states 
would be simpler, in the sense that the uncertainty in the 
current state would be lower. We conclude that causal 
states are optimal, but for the hardest job — that of pre- 
dicting futures of all lengths. 

Remark 3. We have already seen (Thm. |], Remark 2) 
that causal states are sufficient statistics for predicting 
futures of all lengths; so are all prescient rivals. A mini- 
mal sufficient statistic is one that is a function of all other 
sufficient statistics |32], p. 38]. Since, in the course of the 
proof of Thm. g we have shown that there is a function 
g from any TZ to S, we have also shown that causal states 
are minimal sufficient statistics. 

We may now, as promised, define the statistical com- 
plexity of a process . 

Definition 12 (Statistical Complexity of a Pro- 
cess) The statistical complexity "C M (0) " of a process O 
is that of its causal states: C^(0) = C^(S). 

Due to the minimality of causal states we see that the 
statistical complexity measures the average amount of 
historical memory stored in the process. Without the 
minimality theorem, this interpretation would not be 
possible, since we could trivially elaborate internal states, 
while still generating the same observed process. for 
those states would grow without bound and so be ar- 
bitrary and not a characteristic property of the process 



Theorem 3 (Causal States Are Unique) For all pre- 
scient rivals Tt, if C M (7V) = C M (<S) , then there exists an 
invertible function between TZ and S that almost always 
preserves equivalence of state: TZ. and 7/ are the same 
as S and e, respectively, except on a set of histories of 
measure 0. 

Proof. From Lemma we know that S = g(TZ) almost 
always. We now show that there is a function / such 
that TZ — f(S) almost always, implying that g = / _1 
and that / is the desired rel ation between the two sets of 
states. To do this, by Eq. (A12) it is sufficient to show 
that H\7Z\S] = 0. Now, it follows from an information- 



theoretic identity (Eq. (A8)) that 



H[S] - H[S\K] = H[TZ] - H[K\S] 



(49) 



Since, by Lemma | H[S\TZ] = 0, both sides of Eq. @ 
are equal to H[S}. But, by hypothesis, H[TZ] = H[S}. 
Thus, ff[7?.|<5>] = and so there exists an / such that 
7Z = f(S) almost always. We have then that f(g(7Z)) = 
TZ and g(f(S)) = S, so g — This implies that / 

preserves equivalence of states almost always: for almost 

all s , s g S, ??( s) = T)( s ) if and only if e( s ) = e( s ). 
QED. 

Remark. As in the case of the Refinement Lemma 0, on 
which the theorem is based, the measure-0 caveats seem 
unavoidable. A rival that is as predictive and as simple 
(in the sense of Def. [|) as the causal states, can assign 
a measure-0 set of histories to different states than the 
e-machine does, but no more. This makes sense: such 
a measure-0 set makes no difference, since its members 
are never observed, by definition. By the same token, 
however, nothing prevents a minimal, prescient rival from 
disagreeing with the e-machine on those histories. 

Theorem 4 (e-Machines Are Minimally Stochas- 
tic) For all prescient rivals 7?., 



H[K'\K] > H[S'\S] 



(50) 



where S' and TZ' are the next causal state of the process 
and the next rj-state, respectively. 

Proof. From Lemma g, S' is fixed by S and S to- 

_4 

gether, thus H[S'\S, S ] = by Eq. (|Al|). Therefore, 
from the chain rule for entropies Eq. (u 



H[S 1 \S}=H[S',S 1 \S] . 



(51) 



We have no result like the Determinism Lemma || 
for the rival states TZ, but entropies are always non- 

negative: H[K'\K, S ] > 0. Since for all L, H[S \TZ] = 

—>L 

H[S | S] by the definition, Def. (O), of prescient rivals, 

-4 . — 1 

H[S \TZ) = H[S \S}. Now we apply the chain rule again, 
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H[Rf, S 1 \U] = His'lK] + H[K'\s\n] 

> h[s \iz} 

= H[t\S] 

= H[S',S 1 \S] 

= H[S'\S] + H[S 1 \S',S] . 



(52) 
(53) 
(54) 
(55) 
(56) 



In going from Eq. (J5J) to Eq. (|55|) we have used Eq. ( |5l[ ) , 
and in the last step we have used the chain rule once 
more. 

Using the chain rule one last time, we have 

hike's 1 \n] = H[n , \n} + H[s 1 \n','R] . (57) 



Putting these expansions, Eqs. (56) and (| 
we get 



together 



H[K'\K] +H[S 1 \n / ,TZ} > H[S'\S] +H[S \S',S] (58) 
H[K'\K]-H[S'\S] > H[S \S',S]-H[S \K',K] . 

From Lemma ^, we know that S — g(TZ), so there is an- 
other function g' from ordered pairs of ^-states to ordered 
pairs of c ausal states: (£', S) — g'(TZ\TZ). Therefore, 
Eq. (A14) implies 



h(s \s',s] > h[s \n',n] . 

And so, we have that 

H[s 1 \s\s}-H[s 1 \n',TZ} >0 

H[K'\K] -H[S'\S] > 

H[n'\n] > h[s'\s\ 



(59) 



(60) 



QED. 

Remark. What this theorem says is that there is no 
more uncertainty in transitions between causal states, 
than there is in the transitions between any other kind 
of prescient effective states. In other words, the causal 
states approach as closely to perfect determinism — in the 
usual physical, non-computation-theoretic sense — as any 
rival that is as good at predicting the future. This sort of 
internal determinism has long been held to be a desider- 
atum of scientific models 172] . 



VI. BOUNDS 

In this section we develop bounds between measures 
of structural complexity and entropy derived from e- 
machines and those from ergodic and information the- 
ories, which are perhaps more familiar. 



Definition 13 (Excess Entropy) The excess entropy 
E of a process is the mutual information between its semi- 
infinite past and its semi-infinite future: 



F, = I[S;S] 



(61) 



The excess entropy is a frequently-used measure of the 
complexity of stochastic processes and appears under a 
variety of names; e.g., "predictive information", "stored 
information" , "effective measure complexity" , and so on 
|73| [79[ . E measures the amount of apparent information 
stored in the observed behavior about the past. As we 
now establish, E is not, in general, the amount of mem- 
ory that the process stores internally about its past; a 
quantity measured by C^. 

Theorem 5 (The Bounds of Excess) The statistical 
complexity bounds the excess entropy E: 



E< C u 



with equality if and only if H[S\ S] = 0. 



(62) 



Proof. E = I[S; S] = H[S] - H[S | S] and, by the 
construction of causal states, H[S \ S] = H[S \S], so 



E = H[S] - H[S \S]=I[S;S] . 



(63) 



Thus, since the mutual information between two vari- 
ables is never larg er t han the self-information of either 
one of them (Eq. (|A|)), E < H[S] = C M , with equality 

if and only if H[S\ S] = 0. QED. 

Remark 1. Note that we have invoked H[S], not 

H[S ], but only while subtracting off quantities like 

H[S | S]- We need not worry, therefore, about the exis- 

fence of a finite L — * oo limit for H[S ], just that of a 

finite L — > oo limit for I[S ; S] and I[S ;<5]. There are 
many elementary cases (e.g., the fair coin process) where 
the latter limits exist while the former do not. 

Remark 2. At first glance, it is tempting to see E 
as the amount of information stored in a process. As 
Thm. [3] shows, this temptation should be resisted. E is 
only a lower bound on the true amount of information 
the process stores about its history, namely C M . We can, 
however, say that E measures the apparent information 
in the process, since it is defined directly in terms of 
observed sequences and not in terms of hidden, intrinsic 
states, as C M is. 

Remark 3. Perhaps another way to describe what E 
measures is to note that, by its implicit assumption of 
block-Markovian structure, it takes sequence-blocks as 
states. But even for the class of block-Markovian sources, 
for which such an assumption is appropriate, excess en- 
tropy and statistical complexity measure different kinds 
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of information storage. Refs. |65|] and |80) showed that in 
the case of one-dimensional range- R spin systems, or any 
other block-Markovian source where block configurations 
are isomorphic to causal states: 



Rh„ 



(64) 



for finite R. Only for zero-entropy-rate block-Markovian 
sources will the excess entropy, a quantity estimated di- 
rectly from sequence blocks, equal the statistical com- 
plexity, the amount of memory stored in the process. 
Examples of such sources include periodic processes, for 
which we have — E — log 2 p, where p is the period. 



Corollary 2 For all prescient rivals 7Z., 
E < H[TZ] . 



(65) 



Proof. This follows directly from Thm. ^, since H[7Z] > 
C„. QED. 

Lemma 8 (Conditioning Does Not Affect Entropy 
Rate) For all prescient rivals Tt, 



h[S] = h[S \K\ , 



(66) 



where the entropy rate h[S] and the conditional entropy 

rate h[S \it] were defined in Eq. (j|) and Eq. (fj^), re- 
spectively. 

Proof. From Thm. || and its Corollary we have 
lim [H[S L ]-H[S L \n]] < lim H[K] , (67) 

L — ► oo \ / L— 'oc 



hm H Cs L \-H[S L \n] < Um Hp 

L— >oo Li L — >oo Lj 



(68) 



Since, by Eq. (|A4j), H[S ] - H[S \K] > 0, we have 



h[S] -h[S \U] =0 



(69) 



QED. 

Remark. Forcing the process into a certain state TZ = p 
is akin to applying a controller, once. But in the infinitc- 

entropy case, H[S ] — >l^oo oo, with which we are con- 
cerned, the future could contain (or consist of) an infi- 
nite sequence of disturbances. In the face of this "grand 
disturbance" , the effects of the finite control are simply 
washed out. 

Another way of viewing this is to reflect on the fact 

that h[S] accounts for the effects of all the dependencies 
between all the parts of the entire semi-infinite future. 



This, owing to the time-translation invariance of station- 
arity, is equivalent to taking account of all the dependen- 
cies in the entire process, including those between past 

and future. But these are what is captured by h[S [R]. 
It is not that conditioning on TZ fails to reduce our un- 
certainty about the future; it does so, for all finite times, 
and conditioning on S achieves the maximum possible 
reduction in uncertainty. Rather, the lemma asserts that 
such conditioning cannot effect the asymptotic rate at 
which such uncertainty grows with time. 

Theorem 6 (Control Theorem) Given a class TZ of 
prescient rivals, 



H[S] - h[S \7Z] < C» 



(70) 



where H[S] is the entropy of a single symbol from the 
observable stochastic process. 

Proof. As is well known (Ref. f§ Thm. 4.2.1, p. 64]), 
for any stationary stochastic process, 

lim = lim H[S L \S } ■ (71) 

Moreover, the limits always exist. Up to this point, we 

have defined h[S] in the manner of the left-hand side; 
recall Eq. It will be convenient in the following to 
use that of the right-hand side. 

From the definition of conditional entropy, we have 

<_1 <_£-l t-L-1 

H[S }=H[S \S }+H[S } 

= H[S \S]+H[S}. (72) 

So we can express the entropy of the last observable the 
process generated before the present as 

H[S 1 }=H[S L ]-H[S L ^S 1 } (73) 

<_1 <_L-1 <_L-1 <_1 

= H[S \S }+H[S }-H[S \S } (74) 
= H[S \S }+I[S ;S } ■ (75) 
We go from Eq. @ to Eq. rtr3) by substituting the first 



RHS of Eq. (|7g) for H[S ]. 

Taking the L — * oo limit has no effect on the LHS, 



H[S ] = lim ( H[S \S 

L — >oc 



<-L-l «_1 

I[S ;S 



(76) 



Since the process is stationary, we can move the first 

-vi-l 

term in the limit forward to H[Sl\S ]■ This limit is 
h[S], by Eq. (|7l|). Furthermore, because of stationarity, 

H[S } = H[S } = H[S}. Shifting the entropy rate h[S] 
to the LHS of Eq. (^) and appealing to time-translation 
once again, we have 
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H[S\-h[S] = lim I[S L 1 ;S 1 } 

L — >oo 

= I[S;S } 

= H[S]-H[S \ S] 

= H[S }-H[S \S] 

= i[s 1 ;S] 

< H[S] = , 



(77) 

(78) 

(79) 

(80) 

(81) 
(82) 



where the last inequality comes from Eq. (|A9|). QED. 

Remark 1. The Control Theorem is inspired by, and is 
a version of, Ashby's law of requisite variety [ j3l| ch. 11]. 
This states that applying a controller can reduce the un- 
certainty in the controlled variable by at most the en- 
tropy of the control variable. (This result has recently 
been rediscovered in Ref. Thinking of the control- 

ling variable as the causal state, we have here a limitation 
on the controller's ability to reduce the entropy rate. 

Remark 2. This is the only result so far where the 
difference between the finite- L and the infinite-!/ cases 
is important. For the analogous result in the finite case, 
see App. |], Thm. @. 

Remark 3. By applying Thm. ^ and Lemma B we 

could go from the theorem as it stands to H[S] — h[S 
| 7Z] < H[TZ}. This has a pleasing appearance of symmetry 
to it, but is actually a weaker limit on the strength of the 
pattern or, equivalently, on the amount of control that 
fixing the causal state (or one of its rivals) can exert. 



VII. CONCLUDING REMARKS 

A. Discussion 

Let's review, informally, what we have shown. We 
began with questions about the nature of patterns and 
about pattern discovery. Our examination of these issues 
lead us to want a way of describing patterns that was at 
once algebraic, computational, intrinsically probabilistic, 
and causal. We then defined patterns in ensembles, in a 
very general and abstract sense, as equivalence classes of 
histories, or sets of hidden states, used for prediction. We 
defined the strength of such patterns (by their forecasting 
ability or prescience) and their statistical complexity (by 
the entropy of the states or the amount of information re- 
tained by the process about its history) . We showed that 
there was a limit on how strong such patterns could get 
for each particular process, given by the predictive ability 
of the entire past. In this way, we narrowed our goal to 
finding a predictor of maximum strength and minimum 
complexity. 

Optimal prediction led us to the equivalence relation 
~ e and the function e and so to representing patterns by 
causal states and their transitions — the e-machine. Our 
first theorem showed that the causal states are maximally 



prescient; our second, that they are the simplest way of 
representing the pattern of maximum strength; our third 
theorem, that they are unique in having this double op- 
timality. Further results showed that e-machines are the 
least stochastic way of capturing maximum-strength pat- 
terns and emphasized the need to employ the efficacious 
but hidden states of the process, rather than just its gross 
observables, such as sequence blocks. 

Why are e-machine states causal? First, e-machine ar- 
chitecture (say, as given by its semi-group algebra) de- 
lineates the dependency between the morphs P(g | S), 
considered as events in which each new symbol deter- 
mines the succeeding morph. Thus, if state B follows 
state A then A is a cause of B and B is an effect of A. 
Second, e-machine minimality guarantees that there are 
no other events that intervene to render A and B inde- 
pendent 0. 

The e-machine is thus a causal representation of all the 
patterns in the process. It is maximally predictive and 
minimally complex. It is at once computational, since it 
shows how the process stores information (in the causal 
states) and transforms that information (in the state-to- 
state transitions), and algebraic (for details on which see 
App. |l|). It can be analytically calculated from given 
distributions and systematically approached from empir- 
ical data. It satisfies the basic constraints laid out in 



Sec. II F 



These comments suggest that computational mechan- 
ics and e-machines are related or may be of interest to 
a number of fields. Time series analysis, decision theory, 
machine learning, and universal coding theory explicitly 
or implicitly require models of observed processes. The 
theories of stochastic processes, formal languages and 
computation, and of measures of physical complexity are 
all concerned with representations of processes — concerns 
which also arise in the design of novel forms of comput- 
ing devices. Very often the motivations of these fields 
are far removed from computational mechanics. But it 
is useful, if only by way of contrast, to touch briefly on 
these areas and highlight one or several connections with 
computational mechanics, and we do so in App. [G|. 



B. Limitations of the Current Results 

Let's catalogue the restrictive assumptions we made at 
the beginning and that were used by our development. 

1. We know exact joint probabilities over sequence 
blocks of all lengths for a process. 

2. The observed process takes on discrete values. 

3. The process is discrete in time. 

4. The process is a pure time series; e.g., without spa- 
tial extent. 



5. The observed process is stationary. 
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6. Prediction can only be based on the process's past, 
not on any outside source of information. 

The question arises, Can any be relaxed without much 
trouble? 

One way to lift the first limitation is to develop a statis- 
tical error theory for e- machine inference that indicates, 
say, how much data is required to attain a given level of 
confidence in an e- machine with a given number of causal 
states. This program is underway and, given its initial 
progress, we describe several issues in more detail in the 
next section. 

The second limitation probably can be addressed, but 
with a corresponding increase in mathematical sophis- 
tication. The information-theoretic quantities we have 
used are also defined for continuous random variables. It 
is likely that many of the results carry over to the con- 
tinuous setting. 

The third limitation also looks similarly solvable, since 
continuous-time stochastic process theory is moderately 
well developed. This may involve sophisticated probabil- 
ity theory or functional analysis. 

As for the fourth limitation, there already exist tricks 
to make spatially extended systems look like time series. 
Essentially, one looks at all the paths through space- 
time, treating each one as if it were a time series. While 
this works well for data compression ^3|, it is not yet 
clear whether it will be entirely satisfactory for captur- 
ing structure More work needs to be done on this 
subject. 

It is unclear at this time how to relax the assumption of 
stationarity. One can formally extend most of the results 
in this paper to non-stationary processes without much 
trouble. It is, however, unclear how much substantive 
content these extensions have and, in any case, a system- 
atic classification of non-stationary processes is (at best) 
in its infant stages. 

Finally, one might say that the last restriction is a pos- 
itive feature when it comes to thinking about patterns 
and the intrinsic structure of a process. "Pattern" is a 
vague word, of course, but even in ordinary usage it is 
only supposed to involve things inside the process, not 
the rest of the universe. Given two copies of a document, 
the contents of one copy can be predicted with an en- 
viable degree of accuracy by looking at the other copy. 
This tells us that they share a common structure, but 
says absolutely nothing about what that pattern is, since 
it is just as true of well-written and tightly-argued sci- 
entific papers (which presumably are highly organized) 
as it is of monkey-at-keyboard pieces of gibberish (which 
definitely are not). 

C. Conclusions and Directions for Future Work 

Computational mechanics aims to understand the na- 
ture of patterns and pattern discovery. We hope that 
the foregoing development has convinced the reader that 



we are neither being rash when we say that we have laid 
a foundation for those projects, nor that we are being 
flippant when we say that patterns are what e-machines 
represent and that we discover them by e-machine recon- 
struction. We would like to close by marking out two 
broad avenues for future work. 

First, consider the mathematics of e-machines them- 
selves. We have just mentioned possible extensions 
in the form of lifting assumptions made in this de- 
velopment, but there are many other ways to go. A 
number of measure-theoretic issues relating to the def- 
inition of causal states (omitted here for brevity) de- 
serve careful treatment, along the lines of Ref. jlQ]. It 
would be helpful to have a good understanding of the 
measurement-resolution scaling properties of e-machines 
for continuous-state processes, and of their relation to 
such ideas in automata theory as the Krohn-Rhodes de- 
composition |pofl. Anyone who manages to absorb Vol- 
ume II of Rcf. pq] would probably be in a position to 
answer interesting questions about the structures that 
processes preserve, perhaps even to give a purely relation- 
theoretic account of e-machines. We have alluded in a 
number of places to the trade-off between prescience and 
complexity. For a given process there is presumably a 
sequence of optimal machines connecting the one-state, 
zero-complexity machine with minimal prescience to the 
e-machine. Each member of the path is the minimal ma- 
chine for a certain degree of prescience; it would be very 
interesting to know what, if anything, we can say in gen- 
eral about the shape of this "prediction frontier" . 

Second, there is e-machine reconstruction, an activity 
about which we have said next to nothing. As we men- 
tioned above (p. |l^), there are already several algorithms 
for reconstructing machines from data, even "on-line" 
ones. It is fairly evident that these algorithms will find 
the true machine in the limit of infinite time and infinite 
data. What is needed is an understanding of the error 
statistics ]85| of different reconstruction procedures of the 
kinds of mistakes these procedures make and the proba- 
bilities with which they make them. Ideally, we want to 
find "confidence regions" for the products of reconstruc- 
tion. The aim is to calculate (i) the probabilities of differ- 
ent degrees of reconstruction error for a given volume of 
data, (ii) the amount of data needed to be confident of a 
fixed bound on the error, or (iii) the rates at which differ- 
ent reconstruction procedures converge on the e-machine. 
So far, an analytical theory has been developed that pre- 
dicts the average number of estimated causal states as a 
function of the amount of data used when reconstructing 
certain kinds of processes |§(J . Once we possess a more 
complete theory of statistical inference for e-machines, 
analogous perhaps to what already exists in computa- 
tional learning theory, we will be in a position to begin 
analyzing, sensibly and rigorously, the multitude of in- 
triguing patterns and information-processing structures 
the natural world presents. 
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APPENDIX A: INFORMATION-THEORETIC 
FORMULAE 

The following formulae prove useful in the development. 
They are relatively intuitive, given our interpretation, 
and they can all be proved with little more than straight 
algebra; see Ref. |62, ch. 2]. Below, / is a function. 



H[X, Y] = H[X] + H[Y\X] 


(Al) 


H[X,Y] > H[X] 


(A2) 


H[X,Y] < H[X]+H[Y] 


(A3) 


H[X\Y] < H[X] 


(A4) 


H[X\Y]= H[X] iff Xis independent of Y 


(A5) 


H[X, Y\Z] = H[X\Z] + H[Y\X, Z] 


(A6) 


H[X,Y\Z] > H[X\Z] 


(A7) 


H[X\ - H[X\Y\ = H[Y] - H[Y\X] 


(A8) 


I[X;Y]<H[X] 


(A9) 


I[X;Y] =H[X] iSH[X\Y] = 


(AlO) 


H[f(X)] < H[X] 


(All) 


H[X\Y] = iff X = f(Y) 


(A12) 


H[f(X)\Y] < H[X\Y] 


(A13) 


H[X\f{Y)\ > H[X\Y] 


(A14) 



Eqs. (Al) and ( |A6| ) are called the chain rules for 
entr opies . Strictly speaking, the right hand side of 
Eq. ( [A 12] ) should read "for each y, P(X = x\Y = y) > 
for one and only one x" . 



APPENDIX B: THE EQUIVALENCE RELATION 
THAT INDUCES CAUSAL STATES 

Any relation that is reflexive, symmetric, and transi- 
tive is an equivalence relation. 

Consider the set S of all past sequences, of any length: 
S = = • • • fl_i : Si e A, L e Z+} . (Bl) 



Recall that s = A, the empty string. We define the 
relation ^ e over S by 

sl K ~, ^ L «■ P(5 fc K ) = P(S \sf) , (B2) 

for all semi-infinite S= sqSiS^---, where K,L G Z + . 
Here we show that ~ e is an equivalence relation by 
reviewing the basic properties of relations, equivalence 
classes, and partitions. (The proof details are straight- 
forward and are not included. See Ref. 87 1.) We 
will drop the length variables K and L and denote by 

s , s , s G S members of any length in the set S of 
Eq. @. 

First, ~ e is a relation on S since we can represent it 
as a subset of the Cartesian product 



SxS = {(s,' 



s", s"' € S} • 



(B3) 



Second, the relation ~ e is an equivalence relation on S 
since it is 

1. reflexive: s~ e s, for all s £ S; 

2. symmetric: s~ e s 



e s ; and 



<— 4—' <— ' «— " 

3. transitive: s ~ e s and s ~ £ s 



Third, if s G S , the equivalence class of s is 
[s] = {*s £S: 's'-e's} • 



(B4) 



The set of all equivalence classes in S is denoted S /~ e 

and is call ed the factor set of S with respect to ~ e . In 
Sec. IV A we called the individual equivalence classes 
causal states Si and denoted the set of causal states 

S = {Si : i = 0,1,..., k - 1}. That is, S = S/~ e - 
(We noted in the main development that the cardinality 
k = \S\ of causal states may or may not be finite.) 

Finally, we list several basic properties of the causal- 
state equivalence classes. 



1. (J- -[s] 



S . 



2- U-=o^ = S. 

3. [s] = [s ] s ~ e s . 

4. If s , s £ S , either 

(a) (s] f| [V] = or 

(b) [s] = [V] . 

5. The causal states <S are a partition of S- That is, 

(a) Si 7^ for each i, 
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(b) Uto Si = S , and 

(c) S l n Sj ■ = for all i ^ j. 

We denote the start state with Sq. The start state is 
the causal state associated with s = A. That is, Sq = [A]. 



APPENDIX C: TIME REVERSAL 

The definitions and properties of the causal states ob- 
tained by scanning sequences in the opposite direction, 

i.e., the causal states S /~e, follow similarly to those de- 
rived just above in App. |^. In general, S /~ 6 S /~ e - 
That is, past causal states are not necessarily the same 
as future causal states; past and future morphs can dif- 
fer; unlike entropy rate JTEj] , past and future statistical 

complexities need not be equal: C^C^; and so on. The 
presence or lack of this type of time- reversal symmetry, as 
reflected in these inequalities, is a fundamental property 
of a process. 



APPENDIX D: e-MACHINES ARE MONOIDS 

A semi-group is a set of elements closed under an as- 
sociative binary operator, but without a guarantee that 
every, or indeed any, element has an inverse Issf] . A 
monoid is a semi- group with an identity element. Thus, 
semi-groups and monoids are generalizations of groups. 
Just as the algebraic structure of a group is generally 
interpreted as a symmetry, we propose to interpret the 
algebraic structure of a semi-group as a generalized sym- 
metry. The distinction between monoids and other semi- 
groups becomes important here: only semi-groups with 
an identity element — i.e., monoids — can contain subsets 
that are groups and so represent conventional symme- 
tries. 

We claim that the transformations that concatenate 
strings of symbols from A onto other such strings form a 
semi-group G, the generators of which are the transfor- 
mations that concatenate the elements of A. The identity 
element is to be provided by concatenating the null sym- 
bol A. The concatenation of string t onto the string s is 
forbidden if and only if strings of the form st have proba- 
bility zero in a process. All such concatenations are to be 
realized by a single semi-group element denoted 0. Since 
if P{st) = 0, then P(stu) = P(ust) = for any string 
u, we require that %g = g% — for all g 6 G. Can we 
provide a representation of this semi-group? 

Recall that, from our definition of the labeled tran- 
sition probabilities, Ty = 5ij. Thus, T^ A ' is an iden- 



tity element. This suggests using the labeled transi 
tion matrices to form a matrix representation of the 
semi-group. Accordingly, first define 



U 



when T, 



and U, 



Ujj by setting 
1 otherwise, to 



remove probabilities. Then define the set of matrices 
U = {T( A )}U{U( g ) ,s e A}. Finally, define G as the 
set of all matrices generated from the set U by recursive 
multiplication. That is, an element g of G is 



<J 



(ab...cd) _ jj(d)jj(c) ^ ^ ^ Jj{b)jj(a) 



(Dl) 



where a, b, . . . c, d G A. Clearly, G constitutes a semi- 
group under matrix multiplication. Moreover, g(, a — bc ) = 
(the all-zero matrix) if and only if, having emitted the 
symbols a ... b in order, we must arrive in a state from 
which it is impossible to emit the symbol c. That is, the 
zero-matrix is generated if and only if the concatenation 
of c onto a ... b is forbidden. The clement is thus the 
all-zero matrix 0, which clearly satisfies the necessary 
constraints. This completes the proof of Pro posit ion El 

We call the matrix representation — Eq. (Dl) taken 
over all words in A k — of G the semi-group machine of 
the e-machine {<S,T}. See Ref. §§]. 



APPENDIX E: ALTERNATE PROOF OF THE 
REFINEMENT LEMMA 

The proof of Lemma ^ carries through verbally, but 
we do not wish to leave loop-holes. Unfortunately, this 
means introducing two new bits of mathematics. 

First of all, we need the largest classes that are strictly 

homogeneous (Def. ||) with respect to S for fixed L; 
these are, so to speak, truncations of the causal states. 
Accordingly, we will talk about S L and c L , which are 
analogous to S and a. We will also need to define the 
function <f% p = P(S L = a L \Tl = p). 

Putting these together, for every L we have 

H[S V = P] = ff£ <^ P P(S V - O] (El) 

>J2rtp H lS L \S L =o- L ] . (E2) 

Thus, 

H[S L I n]=J2 P ( U = P)H[S L \R = p] (E3) 
p 

> P (K = P) E ^p R \S \S L = ° L \ (E4) 
p u L 

= J2P(K = p)ti p H[S L \S L = a L ] (E5) 

a L ,p 

= ^2P(S L =a L ,K = p)H[S L \S L =<j l ] (E6) 

= F (S L = ° L )H[S \S L = a L \ (E7) 

= H[S L \S L ]. (E8) 
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That is to say, 



1. Time Series Modeling 



H[S \K] > H[S \s L ] 



(E9) 



with equality if and only if every is either or 1. 

Thus, if H [S \TZ] = H[S \S L ], every p is entirely con- 
tained within some a L ; except for possible subsets of 
measure 0. But if this is true for every L — which, in 
the case of a prescient rival 1Z, it is — then every p is 

at least weakly homogeneous (Def. 0) with respect to 

— >l „ 
all S ■ Thus, by Lemma pL all its members, except for 

that same subset of measure 0, belong to the same causal 

state. QED. 



APPENDIX F: FINITE ENTROPY FOR THE 
SEMI-INFINITE FUTURE 

While cases where H[S] is finite — more exactly, where 

liniL^oo H[S ] exists and is finite — may be uninterest- 
ing for information-theorists, they are of great interest to 
physicists, since they correspond, among other things, to 
periodic and limit-cycle behaviors. There are, however, 
only two substantial differences between what is true 
of the infinite-entropy processes considered in the main 
body of the development and the finite-entropy case. 
First, we can simply replace statements of the form 

"for all L, H[S ] ■ ■ ■ " with H [Si For example, the 
optimal prediction theorem (Thm. [I]) for finite-entropy 

processes becomes for all TZ, H[S \Tt] > H[S \S}. The 
details of the proofs are, however, entirely analogous. 

Second, we can prove a substantially stronger version 
of the control theorem (Thm. ||) . 

Theorem 7 (The Finite-Control Theorem) For all 

prescient rivals 7t, 



H[S] - H[S \n\ < c„ 



(Fl) 



Proof. By a direct application of Eq. (A9) and the 
definition of mutual information Eq. (R|), we have that 



H[S] - H[S \S] < H[S] 



(F2) 



But, by the definition of prescient rivals (Def. [Tl]), H[S 

\S] = H[S and, by definition, C* M = H[S}. Substi- 
tuting equals for equals gives us the theorem. QED. 

APPENDIX G: RELATIONS TO OTHER FIELDS 



The goal of time series modeling is to predict the fu- 
ture of a measurement series on the basis of its past. 
Broadly speaking, this can be divided into two parts: 
identify equivalent pasts and then produce a prediction 
for each class of equivalent pasts. That is, we first pick 

a function r\ : S i— > TL and then pick another function 

p : Tt i— >S- Of course, we can choose for the range 
of p futures of some finite length (length 1 is popular) 
or even choose distributions over these. While practical 
applications often demand a single definite prediction — 
"You will meet a tall dark stranger", there are obvious 
advantages to predicting a distribution — "You have a .95 
chance of meeting a tall dark stranger and a .05 chance of 
meeting a tall familiar albino." Clearly, the best choice 
for p is the actual conditional distribution of futures for 
each p £ TZ. Given this, the question becomes what the 
best H. is; i.e., What is the best rp. At least in the case 
of trying to understand the whole of the underlying pro- 
cess, we have shown that the best r\ is, unambiguously, 
e. Thus, our discussion has implicitly subsumed that of 
traditional time series modeling. 

Computational mechanics — in its focus on letting the 
process speak for itself through (possibly impoverished) 
measurements — follows the spirit that motivated one ap- 
proach to experimentally testing dynamical systems the- 
ory. Specifically, it follows in spirit the methods of re- 
constructing "geometry from a time series" introduced 
by Refs. |9(J an d |0- A closer parallel is found, how- 
ever, in later work on estimating minimal equations of 
motion from data series |92|. 



2. Decision-Theoretic Problems 

The classic focus of decision theory is "rules of induc- 
tive behavior" [p3|-[9^] . The problem is to chose functions 
from observed data to courses of action that possess de- 
sirable properties. This task has obvious affinities to con- 
sidering the properties of e and its rivals rj. We can go 
further and say that what we have done is consider a de- 
cision problem, in which the available actions consist of 
predictions about the future of the process. The calcu- 
lation of the optimum rule of behavior in general faces 
formidable technicalities, such as providing an estimate 
of the utility of every different course of action under 
every different hypothesis about the relevant aspects of 
the world. On the one hand, it is not hard to concoct 
time-series tasks where the optimal rule of behavior does 
not use e at all. On the other hand, if we simply aim to 
predict the process indefinitely far into the future, then 
because the causal states are minimal sufficient statistics 
for the distribution of futures (Thm. ||, Remark 4), the 
optimal rule of behavior will use e. 
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3. Stochastic Processes 

Clearly, the computational mechanics approach to pat- 
terns and pattern discovery involves stochastic processes 
in an intimate and inextricable way. Probabilists have, 
of course, long been interested in using information- 
theoretic tools to analyze stochastic processes, particu- 
larly their ergodic behavior |^ , |96| -|98|] . There has also 
been considerable work in the hidden Markov model and 
optimal prediction literatures on inferring models of pro- 
cesses from data or from given distributions 0,^9 102 1. 
To the best of our knowledge, however, these two ap- 
proaches have not been previously combined. 

Perhaps the closest approach to the spirit of compu- 
tational mechanics in the stochastic process literature 
is, surprisingly the now-classical theory of optimal pre- 
diction and filtering for stationary processes, developed 
by Wiener and Kolmogorov [ 103- 106| . The two theories 
share the use of information-theoretic notions, the uni- 
fication of prediction and structure, and the conviction 
that "the statistical mechanics of time series" is a "field 
in which conditions are very remote from those of the 
statistical mechanics of heat engines and which is thus 
very well suited to serve as a model of what happens in 
the living organism" [106, p. 59]. So far as we have been 
able to learn, however, no one has ever used this theory 
to explicitly identify causal states and causal structure, 
leaving these implicit in the mathematical form of the 
prediction and filtering operators. Moreover, the Wiener- 
Kolmogorov framework forces us to sharply separate the 
linear and nonlinear aspects of prediction and filtering, 
because it has a gre at deal of trouble calculating non- 
linear operators [105]. Computational mechanics is com- 
pletely indifferent to this issue, since it packs all of the 
process's structure into the e-machine, which is equally 
calculable in linear or strongly nonlinear situations. 



4. Formal Language Theory and Grammatical 
Inference 

A formal language is a set of symbol strings ("words" 
or "allowed words") drawn from a finite alphabet. Ev- 
ery formal language may be described either by a set of 
rules (a "grammar") for creating all and only the allowed 
words, by an abstract automaton which also generates 
the allowed words, or by an automaton which accepts 
the allowed words and rejects all "forbidden" words. Our 
e-machines, stripped of probabilities, correspond to such 
automata — generative in the simple case or classificatory, 
if we add a reject state and move to it when none of the 
allowed symbols ar e encoun tered. 

Since Chomsky [ 107, 108 1, it has been known that for- 
mal languages can be classified into a hierarchy the 
higher levels of which have strictly greater expressive 
power. The hierarchy is defined by restricting the form 
of the grammatical rules or, equivalently, by limiting the 



amount and kind of memory available to the automata. 
The lowest level of the hierarchy is that of regular lan- 
guages, which may be familiar to Unix-using readers as 
regular expressions. These correspond to finite-state ma- 
chines and to hidden Markov models of finite dimension. 
In such cases, relatives of our minimality and unique- 
ness theorems are well known [ p6[ , and the construction 
of causal states is anal ogou s to the "Nerode equivalence 
classing" procedure |6£,10S]. Our theorems, however, are 
not restricted to this low-memory, non-stochastic setting. 

The problem of learning a language from observational 
data has been extensively studied by linguists, and by 
computer scientists interested in natural-language pro- 
cessing. Unfortunately, well developed learning tech- 
niques exist only for the two lowest classes in the Chom- 
sky hierarchy, the regular and the context-free lang uage s. 
(For a good account of these procedures see Ref. [110].) 
Adapting and extending this work to the reconstruction 
of e-machines should form a useful area of future research, 
a point to which we alluded in the concluding remarks. 



5. Computational and Statistical Learning Theory 



The goal of computational learning theory [ 1 1 1 , 1 12 1 is 
to identify algorithms that quickly reliably, and simply 
lead to good representations of a target "concept" . The 
latter is typically defined to be a binary dichotomy of 
a certain feature or input space. Particular attention is 
paid to results about "probably approximately correct" 
(PAC) procedures 113]: those having a high probabil- 
ity of finding members of a fixed "representation class" 
(e.g., neural nets, Boolean functions in disjunctive nor- 
mal form, and deterministic finite automata). The key 
word here is "fixed" ; as in contemporary time-series anal- 
ysis, practitioners of this discipline acknowledge the im- 
portance of getting the representation class right. (Get- 
ting it wrong can make easy problems intractable.) In 
practice, however, they simply take the representation 
class as a given, even assuming that we can always count 
on it having at least one representation which exactly cap- 
tures the target concept. Although this is in line with im- 
plicit assumptions in most of mathematical statistics, it 
se ems dub ious when analyzing learning in the real world 
§,mjll5|. 

In any case, the preceding development made no such 
assumption. One of the goals of computational mechan- 
ics is, exactly, discovering the best representation. This 
is not to say that the results of computational learning 
theory are not remarkably useful and elegant, nor that 
one should not take every possible advantage of them 
in implementing e-machine reconstruction. In our view, 
though, these theories belong more to statistical infer- 
ence, particularly to algorithmic parameter estimation, 
than to foundational questions about the nature of pat- 
tern and the dynamics of learning. 

Finally, in a sense computational mechanics' focus on 
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causal states is a search for a particular kind of structural 
decomposition for a process. That decomposition is most 
directly reflected in the conditional independence of past 
and future that causal states induce. This decomposi- 
tion reminds one of the important role that conditional 
independence plays in contemporary methods for artifi- 
cial intelligence, both for develo ping systems that rea- 
son in fluctuating environments 116] and the more re- 
cen tly devel oped algorithmic methods of graphical mod- 
els |U7|JlT8|l . 



6. Description-Length Principles and Universal 
Coding Theory 

Rissanen's minimum description length (MDL) prin- 
ciple, most fully described in Ref. [Q, is a procedure 
for selecting the most concise generative model out of a 
family of models that are all statistically consistent with 
given data. The MDL approach starts from Shannon's re- 
sults on the connection between probability distributions 
and codes. Rissanen's development follows the inductive 
framework introduced by Solomonoff |Q . 

Suppose we choose a representation that leads to a 
class M. of models and are given data set X. The MDL 
principle enjoins us to pick the model M £ M that mini- 
mizes the sum of the length of the description of X given 
M, plus the length of description of M given M.. The 
description length of X is taken to be — logP(X|M); 
cf. Eq. (||). The description length of M may be regarded 
as either given by some coding scheme or, equivalently, by 
some distribution over the members of M. (Despite the 
similarities to model estimation in a Bayesian framework 
[ 1 19 1 , Rissanen does not interpret this distribution as a 
Bayesian prior or regard description length as a measure 
of evidential support.) 

The construction of causal states is somewhat simi- 
lar to the states estimated in Rissanen's context algo- 
rithm p£]jl20( | (and to the "vocabularies" built by uni- 
versal codi ng schem es, such as the popular Lempel-Ziv 
algorithm [ 121 , 122 1). Despite the similarities, there are 
significant differences. For a random source — for which 
there is a single causal state — the context algorithm es- 
timates a number of states that diverges (at least loga- 
rithmically) with the length of the data stream, rather 
than inferring a single state, as e-machine reconstruction 
would. Moreover, we avoid any reference to encodings of 
rival models or to prior distributions over them; C^iTV) 
is not a description length. 



7. Measure Complexity 

Ref. |75|] proposed that the appropriate measure of the 
complexity of a process was the "minimal average Shan- 
non information needed" for optimal prediction. This 
true measure complexity was to be taken as the Shannon 



entropy of the states used by some optimal predictor. 
The same paper suggested that it could be approximated 
(from below) by the excess entropy; there called the ef- 
fective measure complexity, as noted in Sec. ^ above. 
This is a position closely allied to that of computational 
mechanics, to Rissanen's MDL principle, and to the min- 
imal embeddings introduced by the "geometry of a time 
series" methods @ just described. 

In contrast to computational mechanics, however, the 
key notion of "optimal prediction" was left undefined, 
as were the nature and construction of the states of the 
optimal predictor. In fact, the predictors used required 
knowing the process's underlying equations of motion. 
Moreover, the statistical complexity C^iS) differs from 
the measure complexities in that it is based on the well 
defined causal states, whose optimal predictive powers 
are in turn precisely defined. Thus, computational me- 
chanics is an operational and constructive formalization 
of the insights expressed in Ref. M . 



8. Hierarchical Scaling Complexity 



Introduced in Ref. [123, ch. 9], this approach seeks, 
like computational mechanics, to extend certain tradi- 
tional ideas of statistical physics. In brief, the method is 
to construct a hierarchy of n'^-order Markov models and 
examine the convergence of their predictions with the real 
distribution of observables as n — > oo. The discrepancy 
between prediction and reality is, moreover, defined in- 
formation theoretically, in terms of the relative entropy 
or Kullback-Leibler distance |32|,(7l|]. (We have not used 
this quantity.) The approach implements Weiss's dis- 
covery that for finite-state sources there is a structural 
distinction between block-Markovian sources (subshifts 
of finite type) and sofic systems. Weiss showed that, de- 
spite their finite memory, sofic systems are the limit of 
an infini te s eries of increasingly larger block-Markovian 
sources [124]. 

The hierarchical-scaling-complexity approach has sev- 
eral advantages, particularly its ability t o ha ndle issues 
of scaling in a natural way (see Ref. [123, sec. 9.5]). 
Nonetheless, 



Sec. QF 



sec. 

it does not attain all the goals set m 
Its Markovian predictors are so many black 
boxes, saying little or nothing about the hidden states 
of the process, their causal connections, or the intrin- 
sic computation carried on by the process. All of these 
properties, as we have shown, are manifest from the e- 
machine. We suggest that a productive line of future 
work would be to investigate the relationship between 
hierarchical scaling complexity and computational me- 
chanics, and to see whether they can be synthesized. 
Along these lines, hierarchical scaling complexity reminds 
us somewhat of hierarchical e-machine reconstruction de- 
scribed in Ref. ||. 
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9. Continuous Dynamical Computing 

Using dynamical systems as computers has become in- 
creasingly attractive over the last ten years or so among 
physicists, computer scientists, and othe rs exploring the 
physical basis of computation [ 125-128 1 . These propos- 
als have ranged from highly abstract ideas about how to 
embed Turing mach ines in discrete-time nonlinear con- 
tinuous maps [fz| jl 2Sf| to, more recently, schemes for spe- 
cialized numerical computation that cou ld in principle 
be implemented in current hardware [ 13C ] . All of them, 
however, have been synthetic, in the sense that they con- 
cern designing dynamical systems that implement a given 
desired computation or family of computations. In con- 
trast, one of the central questions of computational me- 
chanics is exactly the converse: given a dynamical sys- 
tem, how can one detect what it is intrinsically comput- 
ing? 

We believe that having a mathematical basis and a 
set of tools for answering this question are important to 
the synthetic, engineering approach to dynamical com- 
puting. Using these tools we may be able to discover, for 
example, novel forms of computation embedded in nat- 
ural processes that operate at higher speeds, with less 
energy, and with fewer physical degrees of freedom than 
currently possible. 
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Description 

Object in which we wish to find a pattern 
Pattern in O 
Countable alphabet 

Bi-infinite, stationary, discrete stochastic process on A 

Particular realization of S 

Random variable for the next L values of S 

Particular value of S 

Next observable generated by S 

As S > but for the last L values, up to the present 

Particular value of S 

Semi-infinite future half of S 

Particular value of S 

Semi-infinite past half of S 

Particular value of S 
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Set of all pasts realized by the process S 

Partition of S into effective states 
Member-class of TZ; a particular effective state 

Function from S to TZ 

Current effective (77) state, as a random variable 

Next effective state, as a random variable 

Entropy of the random variable X 

Joint entropy of the random variables X and Y 

Entropy of X conditioned on Y 

Mutual information of X and Y 

Entropy rate of S 

Entropy rate of S conditioned on X 
Statistical complexity of 7Z 

Set of the causal states of S 

Particular causal state 

Function from histories to causal states 

Current causal state, as a random variable 
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Relation of causal equivalence between two histories 

Probability of going from causal state i to j, emitting s 

Set of prescient rival states 

Particular prescient rival state 
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Next prescient rival state, as a random variable 

Statistical complexity of the process O 
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